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Abstract 

A neural network with fixed topology can be regarded 
as a parametrization of functions, which decides on the 
correlations between functional variations when param- 
eters are adapted. We propose an analysis, based on a 
differential geometry point of view, that allows to calcu- 
late these correlations. In practise, this describes how 
one response is unlearned while another is trained. Con- 
cerning conventional feed-forward neural networks we 
find that they generically introduce strong correlations, 
are predisposed to forgetting, and inappropriate for task 
decomposition. Perspectives to solve these problems are 
discussed. 



I Introduction 

Following Kerns et al. (1995), the problem of model 
selection may be defined as follows: Given a finite 
set of data points, find a function (or conditional 
probability distribution, also called hypothesis) such 
that the expected generalization error is minimized. 
Typically, the search space J (the space of functions 
or conditional probability distributions) is assumed 
to be organized as a nested sequence of subspaces 
5Fi C .. C 3 r d C .. C 5F of increasing complexity. 
For instance, the index d may denote the number 
of parameters or the Vapnik-Chervonenkis dimension 
(Vapnik 1995). Finding the function with minimal 
generalization error then amounts to finding the ap- 
propriate sub-search-space before applying ordinary 
optimization schemes. Many approaches introduce a 
penalty term related to complexity which has to be 
minimized together with the training error. Penalty 
terms are, for example, the number of parameters of 
the model, the number of effective model parameters, 
the Vapnik-Chervonenkis dimension, or the descrip- 
tion length (Akaike 1974; Amari 1993; Moody 1991; 
Rissanen 1978; Vapnik 1995). An alternative based 
on geometric arguments is presented by Schuurmans 
(1997). 

The emphasis of our investigations is different to 
these classical approaches. The choice of a specific 
model (e.g., a neural network) to represent a function 
has two implications: it defines the space 3^ of repre- 
sentable functions, but it also defines a parametriza- 
tion of this space, where parametrization is not meant 



in the sense of 'finding parameters' but in the sense 
of introducing coordinates on that space, i.e., intro- 
ducing a mapping (f> : R m — > 3^ from some coordi- 
nate space M m onto the sub-search-space. To omit 
confusion, we use the term model class for the sub- 
search-space 3"^, and model parametrization for the 
parametrization $ of this sub-search-space. For ex- 
ample, an artificial neural network with m free pa- 
rameters, fixed topology, and fixed activation func- 
tions defines a model class (the subspace of functions 
it can realize — which, if the topology is appropriate, 
includes an approximation of any function (Hornik, 
Stinchcombe, & White 1989)) but it also defines a 
model parametrization (the mapping from its param- 
eters to the corresponding function). 

Our emphasis is on the implications of a specific 
model parametrization instead of the choice of a cer- 
tain model class. It is important to have a closer look 
at this parametrization in order to allow for an ana- 
lytical description of the adaptation dynamics, rather 
than just analyzing the complexity of a model class. 
In particular, the precise relation between variations 
of parameters and functional variations of the system 
is of fundamental interest because it decides, e.g., on 
the way of "extrapolation" , or on how the system 
forgets previously learned data. This relation can 
be derived from the model parametrization and our 
goal is to extract such features analytically. We fo- 
cus on forgetting as a specific character of adapta- 
tion dynamics and develop an analysis of the model 
parametrization that allows to approximate the rate 
of forgetting. This analysis is based on a differen- 
tial geometry point of view and is related to a large 
pool of research, including the discussions of cross- 
talk (Jacobs, Jordan, & Barto 1990) and catastrophic 
forgetting (French 1999), the information geometry 
point of view on parameter adaptation (Amari 2000), 
and perfectly analogous ideas in the context of evo- 
lutionary adaptation (Toussaint 2001). Section III 
includes a discussion of these relations. 

We apply our method of analyzing the model para- 
metrization on the class of standard feed-forward neu- 
ral networks (FFNNs). We find that the variety of 
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FFNNs with arbitrary topology is actually not a great 
variety with respect to certain characters of the model 
parametrization. In particular, FFNNs gncrically in- 
troduce strong correlations between functional vari- 
ations and thereby are predisposed to forget previ- 
ously learned data. Hence, using FFNNs as a func- 
tion model means a limitation — not with respect to 
representable functions but with respect to learning 
characteristics. A simple example compares a stan- 
dard FFNN with a network that includes competi- 
tive interactions. The results validate our analytical 
predictions and illustrate their implications. We con- 
clude that a generalization of the class of FFNNs is 
necessary and that the introduction of competitive in- 
teractions between neurons is a promising approach 
to solve these problems. 

Section II will introduce to the formalism our in- 
vestigations are based on and, in section III, we de- 
scribe the analysis of the model parametrization. Sec- 
tion IV presents the examples and in section 5 we give 
an outlook concerning the evolutionary perspective 
on model selection and discuss the relevance of the 
limitedness of FFNN models. The conclusion follows 
up. 



II Definitions 

II. 1 The functional point of view 

Let 5" be the search space. Here, 2f shall be the space 
of all functions mapping from a finite space X to 
Y C M™. However, all results can be transferred to 
the search space of conditional probabilities, as we 
discuss below. 

The space of functions / : X — > Y can be writ- 
ten as Y x , which is isomorphic to R n 'l x l. Thus, let a 
function / G Y x be represented by n-\X\ components 
f a e K, where the index a refers to a specific point 
in X and a F-dimcnsion. (The components f a may 
be regarded as entries of a lookup-table representa- 
tion of /.) On this representation, we describe an 
online adaptation step as a probabilistic transition to 
a new function as follows: Assume that adaptation is 
initiated by the observation of a target value t a for 
a functional component f a . A transition occurs as 
a variation Sf e R n l x l with probability p{Sf \ f a ,t a ). 
The interesting point is that functional components 
of which no target value has been observed may vary 
as well. Let a be a random variable and consider the 
density p(Sf) — p(Sf | / a , t a )p(a). We will refer to the 
respective covariance between two variation compo- 



nents as the functional covariance matrix 

C bc :=cov pm (5f b ,6f c ). (1) 

This matrix is a first order description of how the 
adaptation of the observed functional component re- 
sults in a coadaptation of a functional component 
which has not been observed. For example, assum- 
ing a linear dependence between 5f a and 6f b , we have 
Sf h = (5f b ) + (Sf a - (Sf a )), where a 2 is the vari- 
ance of 8f a . Whether this coadaptation is desirable 
or not depends on the problem. Coadaptation is 
also an explicit description of the "way of generaliza- 
tion" 1 : unobserved functional components (i.e., the 
functional response on stimuli that have not been 
observed) are coadapted depending on the adapta- 
tion of observed functional components. In general, 
one would like to choose from a variety of different 
coadaptation schemes, i.e., one would like to select a 
model from a variety of models with different kinds 
of coadaptation. We will find that this refers to the 
selection of a model parametrization. 

When the set of functional components can be sep- 
arated in two disjoint subsets such that C ab vanishes 
for two components f a and f b of different subsets, 
then we speak of adaptation decomposition. Dur- 
ing online learning, adaptation decomposition means 
that the development of two such components dur- 
ing successive adaptation is not correlated. In terms 
of homogeneous Markov processes, successive adapta- 
tion is described by the transition probability p(Sf \ f a ,t a ) 
(assuming that the draw of a from p(a) is indepen- 
dent at each time), and adaptation is decomposed if 
p(5f a ,Sf b )=p(S.r)p(Sf b ). 

II. 2 The parameter point of view 

We now address the modeling of functions. Let $ be 
a m-dimensional, differentiablc parametrization of a 
subset $(W) of functions: 

$ : W -> J , W C K m , (2) 
HW):= |J {*(«,)} CJ. (3) 

We call $ the model parametrization and $(W) the 
model class. In terms of differential geometry, <j> is 
the inverse of a coordinate map (or chart, or atlas) 
for $(W). Since this map is differentiable, it induces 
a metric on $(W) if one on W is given and vice versa. 

By "way of generalization" wc do not refer to the gener- 
alization error but to the way of extrapolation from observed 
data to unobserved. 
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We define the functional metric g ah (w) on $(W) as 
the lift of the Euclidean metric on W, 



9 ab M :=£ 



d$(w) a d$(w) b 
dw l dw l 



(4) 



and we define the parameter metric gij (w) on W (ac- 
tually on the dual tangent spaces of W) as the pull- 
back of the Euclidean metric on $(W), 



d<£(w) a d$(w) a 



dw % 



(5) 



and only if the functional metric is a block matrix 
and b and c refer to different blocks: 



ab 



A e 



tXfl 











B G 



b < n , c > [i , 



where A and B are arbitrary symmetric matrices and 
fi + v = n ■ \X\. Thus, adaptation is decomposed into 
two subsets of functional components exactly if the 
functional metric is a block matrix and the functional 
component subsets correspond to these blocks. 2 



As usual in differential geometry, the metrics depend 
on the locality given by w. These metrics describe the 
relation between parameter variations and functional 
variations as we explore in more detail in the next 
section. 

Ill Analysis of the model 
parametrization 

In the previous section we defined the correlation ma- 
trix C ab on the functional level. Now we analyze what 
the choice of a model parametrization $ implies on 
this functional level. Given $ and parameters w, we 
write f a = <f>(u>) a . Assume that a target t a was ob- 
served and adaptation of the parameters takes place 
by a gradient descent, 



df a 

5 W > = 2a+- <t a - f a ) 
dw lK J ' 



(6) 



which corresponds to the gradient of the squared er- 
ror multiplied by an adaptation rate a. In first order 
approximation, this induces a functional variation 

s f b = 2a Ef^ = 2a5a6( *°~ /a) ' ( ? ) 

i 

using definition (4). Thus, the functional metric g ab 
describes the variation of a functional component f b 
when t a is observed. This gives a first order descrip- 
tion of coadaptation and of how the model generalizes 
the experience of a target value t a in order to adapt 
also functional components f b . In this approximation 
the functional covariance reads 

C bc = 4a 2 £p(a) 9 ba 9 ca (t a - f a f - {Sf a )(8f b ) 

a 

(8) 

To discuss this expression, let us assume that the sec- 
ond term vanishes, (Sf a )(6f b ) = 0. Concerning the 
first term, the product g ba g ca vanishes for all a if 



III.l Reference to related research 

Cross-talk. The inspiring work by Jacobs et al. 
(1990) discusses the implication of the choice of a 
multi-expert model on the learning speed and gener- 
alization behavior. They formulate the idea of spatial 
and temporal crosstalk, which denotes the statistical 
dependence between the states of two different neu- 
rons or between the states of a neuron at two different 
times. In our formalism, this crosstalk is captured by 
the functional covariance — spatial for two indices a 
and b belonging to the same input x G X, and tempo- 
ral for two indices of different input. They argue that 
such a crosstalk may be undesirable and is avoided by 
explicitly separating neurons in disjoint experts. As 
we will see below, selecting a multi-expert model is 
a very intuitive way to explicitly declare an indepen- 
dence of functional components and realize decom- 
posed adaptation. In fact, the separation into experts 
corresponds to a block matrix type functional metric. 
(If the gating is also adaptive, the functional metric 
is actually not a completely clean block matrix.) 

In the context of artificial neural networks, the 
term catastrophic forgetting has been used to describe 
negative effects of coadaptation. See (French 1999) 
for a review. 



2 Note the relation to group theory: A group representation 
is said to be reducible if all group generators can be represented 
as a block matrix (such that all of them fit in the same block 
template). On this basis, physics defines the notion of an ele- 
mentary particle as corresponding to an irreducible representa- 
tion, whereas physical systems that correspond to a reducible 
representation (a block matrix) arc considered as composed of 
particles. A system of which the adaptation dynamics (instead 
of physical interactions) can be decomposed in the sense of a 
block matrix can analogously be thought of as composed of 
subsystems. 

More formally, the observation of a target t a can be identi- 
fied with an element of a group that applies on the functional 
components. Adaptation dynamics is now interpreted as suc- 
cessive application of group elements. The group representa- 
tion (i.e., the way the group elements apply on the functional 
components) is determined by the model parametrization. If 
adaptation is decomposed, this representation is reducible. 



3 



Proceedings of the International Joint Conference on Neural Networks (I J CNN 2002) 



Information geometry. The methods applied in 
this paper are related to information geometry. Let 
Y = S v — [0, l] 2 " -1 be the T - 1 dimensional man- 
ifold of probability distributions over {0,1}" as de- 
fined by Arami (2000). Then, the search space 1 of 
mappings X — > Y is the space of all conditional prob- 
abilities p(y\x), x E X,y e Y. Usually, one assumes 
the Fisher metric on 5F, not the Euclidean. Thus, we 
would have to change the definition (5) of the param- 
eter metric into 

\ d log p(x , y; w) d log p(x,y;w)' 

(9) 

where E[.\ denotes the expectation and p(x,y;w) = 
p(y\x; w) p(x) , p(y\x;w) = $(w) G 1. Arami (1998) 
uses this metric to define the natural gradient descent 
on the parameter space (which actually is the covari- 
ant derivative instead of the contravariant). The use 
of the natural gradient can also be motivated by a 
spatio-temporal decorrelation (Choi, Amari, & Ci- 
chocki 2000). 

Evolutionary computation. It seems that in the 
field of evolutionary computation the discussion of 
the covariance structure in the search space is much 
more elaborated than in the field of neural computa- 
tion (see Toussaint 2001). Roughly speaking, the goal 
of evolutionary computation is to maximize the prob- 
ability of good mutations during evolutionary search. 
Eventually, fitness requires some phenotypic traits to 
be mutated in correlation. Such correlations (coad- 
aptation) may be modeled explicitly in the search 
density of evolutionary algorithms (Baluja & Davies 
1997; Hansen & Ostermeier 2001; Miihlenbein, Mah- 
nig, & Rodriguez 1999; Pelikan, Goldberg, & Lobo 
1999). Alternatively, they may be induced implic- 
itly by the choice of a good paramctrization of phe- 
notypic traits — by a genotype-phenotype mapping, 
which is in perfect analogy to the model parametriza- 
tion <I>. Many research efforts focus on the choice or 
the understanding of the genotype-phenotype map- 
ping (Stephens & Waelbroeck 1999; Toussaint 2001; 
Wagner & Altenberg 1996). In this view, functional 
components f a may be compared to phenotypic traits, 
whereas parameters relate to the genotype. 

IV Example 

Our test of the learning behavior is very simple: a 
regression of only two patterns in {0, l} 3 has to be 
learned by mapping the first pattern on +1 and the 
second on —1. However, we impose that these pat- 



The feed-forward neural network we investigate 
here is 3-4-1-layered; layers are completely con- 
nected; the output neurons are linear, the hidden 
ones implement the sigmoid 1+cxp ^ 10a .^ ; only the 
hidden neurons have bias terms. 



Table 1: The Standard model 



Table 2: The Softmax model 

terns have to be learned online where they alternate 
only after they have been exposed for 100 times in 
succession. 3 We test two systems on this task: a 
standard feed-forward neural network as described in 
detail in table 1, and a system that involves a soft- 
max layer as described in table 2. The parameters 
of both systems are initialized randomly by the nor- 
mal distribution N(0, 0.1) around zero with standard 
deviation 0.1. The two patterns were chosen as 110 
and 010. Learning is realized by a slow gradient de- 
scent with adaptation rate 2 • 10~ 3 and momentum 
0.5. The metric components are calculated from the 
gradients. 

Please see Figures 1 and 2 for the results. For 
the standard neural model we observe some forget- 
ting of the untrained pattern during the training of 
the other. For the softmax model, the error of the 
untrained pattern hardly increases. The rate of for- 

3 This task is not meant as a performance test but as an 
experimental setup to test our analytical methods. However, 
similar effects of learning and unlearning occur in online learn- 
ing when a specific response is unlearned during the course of 
training other responses for several time steps. In real world 
simulations it is also plausible that stimuli remain unchanged 
for many time steps. 



The softmax model is the same as the standard 
model with the exception that the four neurons 
in the hidden layer compete for activation: their 
output activations yi are given by 

e 30xi 

Di = , Xi = ^ W HVo + w i > 

j G input 

x = e30Xt ■ ( 10 ) 

i£ hidden 

Here, Wy and Wi denote weight and bias pa- 
rameters. The exponent factor 30 may be inter- 
preted as rather low temperature, i.e., high com- 
petition. The calculation of the gradient is a little 
more involved than ordinary back-propagation but 
straightforward and of same computational cost 
(see (Toussaint 2002)). 
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Figure 1 : Test of the standard model. 
For all four graphs the abscissa denotes the time step. 
Top: The learning curves (errors) with respect to both patterns 
are displayed. Only one of the patterns is trained — alternating 
every 100 time steps. The error of the untrained patterns in- 
creases. 

Second: The slope (change of error per time step) of the un- 
trained learning curve is displayed. The dotted line refer to 
the measured slope of the upper curve, the normal line is cal- 
culated according to equation (7). 

Third: The slope (measured and calculated) of the trained 
learning curve. 

Bottom: The three components of the functional metric g 00 , 
g 01 , g 11 arc displayed in logarithmic scale. In particular the 
cross-component g 01 is clearly non-vanishing. 



getting, given by the slope of the error curve, is well 
described by equation (7) and demonstrated by the 
graphs in the middle. The bottom graphs display 
the functional metric components and generally ex- 
hibit that the cross-component g m , which is responsi- 
ble for coadaptation and forgetting, is quite large for 
the standard model compared to the softmax model. 
Further, the softmax model seems to learn the adap- 
tation decomposition, as defined in section II, after 
the 200th time step. All these results reveal that the 
standard model is not well-suited to solve the simple 
task given and that the analysis of the model's func- 
tional metric provides a formal way of understanding 
this phenomenon. Remarkably also, the components 
g 00 and g 11 become significantly greater than 1 during 
the training phase of the respective functional compo- 
nent. By equation (7), this means that the "effective" 
adaptation rate is larger than 2 • 10~ 3 . 

One might object that the results given above rely 
on the random initialization and on the specific task 



Figure 2: Test of the softmax model. 
Top: The learning curves (errors) with respect to both patterns 
are displayed. The untrained patterns is scarcely forgotten. 
Second: The slope (measured and calculated) of the untrained 
learning curve nearly vanishes. 

Third: The slope (measured and calculated) of the trained 
learning curve. 

Bottom: The three components of the functional metric g 00 , 
g 01 , g 11 (in logarithmic scale). The cross-component g 01 is 
small, it decreases significantly at time step 200. 

we chose. To analyze both types of models in a more 
general way we perform another test. We investi- 
gate the distribution of the functional metric com- 
ponents when parameters are normally distributed 
by K(0,0.1). Figure 3 shows the distributions for 
both models. Clearly, the standard model exhibits 
a Gauss-like distribution of the cross-component g 01 
with mean around 1.5; a vanishing cross-component 
g 01 is not very likely. On the other hand, the soft- 
max model exhibits two strong peaks at g 01 = and 
g m = 1, such that the probability for g 01 < 0.1 
is larger than 10%. These distributions are generic 
properties of the two models. 

V Toward evolutionary model 
selection 

Finally, the question of how to select an appropriate 
model has not yet been addressed. As discussed in 
the introduction, classical approaches to model selec- 
tion commonly introduce a penalty term in order to 
reduce the model's complexity. Following this tradi- 
tion we could introduce a penalty term that reduces 
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Figure 3: Distribution of metric components. 
The distribution was calculated as a histogram of 1 
million samples by using bins of size j-^ . The ordi- 
nate is scaled in "percent of samples that fell into the 
bin". 

Top: The standard model. The probability of van- 
ishing cross-component g 01 is vary small. 
Bottom: The softmax model. The inset graph is in 
logarithmic scale. The probability of vanishing cross- 
component g 01 is fairly high. 

forgetting. Consider 

E(3 ab ) 2 -E(-9 aa ) 2 - ( n ) 

ab a 

This is a measure of the cross-components in the func- 
tional metric. Unfortunately, we cannot present any 
experiments with this model selection criterion here. 
This approach is postponed to future research. 

The original motivation for this work, though, was 
not to develop a new model selection criterion as 
given by the above penalty term. Instead we believe 
that the evolution of neural networks, as it recently 
became an elaborated branch of research (see (Yao 
1999) for a review), is actually a promising method of 
model selection. However, most of these approaches 
focus on standard neural models, i.e., the evolution- 
ary search space is the space of ordinary feed- forward 
neural networks (FFNNs) with arbitrary topology. 
The belief is that the variety of topologies offers a 
variety of functionally different models. The present 
paper is a critique of this belief because it supports 
that the functional metric inherent of FFNNs com- 
prises significantly non-vanishing cross-components. 
This implies that the variety of FFNNs with arbi- 
trary topology is actually not a great variety with re- 
spect to the functional metric. E.g., it hardly includes 
models with vanishing cross-components and low rate 



of forgetting. In conclusion, the search space has to 
be generalized to contain also models with arbitrary 
functional metric in order to allow for the selection of 
more optimal models. The presented softmax model 
involving competitive interactions between neurons is 
a step in this direction, but much motivation is left 
for future research toward the generalization of the 
model search space and evolutionary methods to se- 
lect good models from this great variety. The model 
presented in (Toussaint 2002) is one approach. 

VI Conclusion 

We developed a new analytical approach to char- 
acterize a function model and describe its learning 
properties. We focussed on functional correlations 
in the adaptation process and derived the relation 
to the functional metric of the model parametriza- 
tion. The analysis can in principal be applied on any 
kind of diffcrentiablc model (also probabilistic, when 
formulated in terms of information geometry). Our 
empirical studies illustrate the approach and demon- 
strate that conventional neural network models arc 
rather limited with respect to their adaptation behav- 
ior: a task separation, i.e., decorrelated adaptation to 
decorrelated data, is hardly possible. In contrast, a 
model involving competitive interactions is more pre- 
disposed for task decomposition. Thus, as we pointed 
out in the previous section, the evolutionary approach 
to model selection should generalize the search space 
to include not only standard feed-forward neural net- 
works, but also models with arbitrary functional met- 
rics, e.g., by allowing for competitive interactions. 
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